66 research outputs found
HistoPerm: A Permutation-Based View Generation Approach for Improving Histopathologic Feature Representation Learning
Deep learning has been effective for histology image analysis in digital
pathology. However, many current deep learning approaches require large,
strongly- or weakly-labeled images and regions of interest, which can be
time-consuming and resource-intensive to obtain. To address this challenge, we
present HistoPerm, a view generation method for representation learning using
joint embedding architectures that enhances representation learning for
histology images. HistoPerm permutes augmented views of patches extracted from
whole-slide histology images to improve classification performance. We
evaluated the effectiveness of HistoPerm on two histology image datasets for
Celiac disease and Renal Cell Carcinoma, using three widely used joint
embedding architecture-based representation learning methods: BYOL, SimCLR, and
VICReg. Our results show that HistoPerm consistently improves patch- and
slide-level classification performance in terms of accuracy, F1-score, and AUC.
Specifically, for patch-level classification accuracy on the Celiac disease
dataset, HistoPerm boosts BYOL and VICReg by 8% and SimCLR by 3%. On the Renal
Cell Carcinoma dataset, patch-level classification accuracy is increased by 2%
for BYOL and VICReg, and by 1% for SimCLR. In addition, on the Celiac disease
dataset, models with HistoPerm outperform the fully-supervised baseline model
by 6%, 5%, and 2% for BYOL, SimCLR, and VICReg, respectively. For the Renal
Cell Carcinoma dataset, HistoPerm lowers the classification accuracy gap for
the models up to 10% relative to the fully-supervised baseline. These findings
suggest that HistoPerm can be a valuable tool for improving representation
learning of histopathology features when access to labeled data is limited and
can lead to whole-slide classification results that are comparable to or
superior to fully-supervised methods
Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models
Large Language Models (LLMs) have significantly advanced the field of Natural
Language Processing (NLP), but their lack of interpretability has been a major
concern. Current methods for interpreting LLMs are post hoc, applied after
inference time, and have limitations such as their focus on low-level features
and lack of explainability at higher level text units. In this work, we
introduce proto-lm, a prototypical network-based white-box framework that
allows LLMs to learn immediately interpretable embeddings during the
fine-tuning stage while maintaining competitive performance. Our method's
applicability and interpretability are demonstrated through experiments on a
wide range of NLP tasks, and our results indicate a new possibility of creating
interpretable models without sacrificing performance. This novel approach to
interpretability in LLMs can pave the way for more interpretable models without
the need to sacrifice performance.Comment: Accepted to the Findings of EMNLP 202
Computational Complexity of Bi-clustering.
Bi-clustering, i.e. simultaneously clustering the rows and columns of matrices based on their entries, covers a large variety of techniques in data mining. The goal of all bi-clustering techniques is finding the partitions of the rows and the columns in which sub-rows and sub-columns show a similar behavior. Currently existing algorithms for bi-clustering problems are either heuristic, or try to solve approximations of the original problems. There is no efficient algorithm for exact bi-clustering problems.
The computational complexity of bi-clustering problems depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given bi-clustering partition. The computational complexity of most of the common bi-clustering problems is unknown. In this thesis, we present a formal definition for the homogeneous cover problem. This problem has many applications from bio-informatics to targeted marketing. We analyze its computational complexity and show that the problem is NP-hard
Predicting prognosis and IDH mutation status for patients with lower-grade gliomas using whole slide images
We developed end-to-end deep learning models using whole slide images of adults diagnosed with diffusely infiltrating, World Health Organization (WHO) grade 2 gliomas to predict prognosis and the mutation status of a somatic biomarker, isocitrate dehydrogenase (IDH) 1/2. The models, which utilize ResNet-18 as a backbone, were developed and validated on 296 patients from The Cancer Genome Atlas (TCGA) database. To account for the small sample size, repeated random train/test splits were performed for hyperparameter tuning, and the out-of-sample predictions were pooled for evaluation. Our models achieved a concordance- (C-) index of 0.715 (95% CI: 0.569, 0.830) for predicting prognosis and an area under the curve (AUC) of 0.667 (0.532, 0.784) for predicting IDH mutations. When combined with additional clinical information, the performance metrics increased to 0.784 (95% CI: 0.655, 0.880) and 0.739 (95% CI: 0.613, 0.856), respectively. When evaluated on the WHO grade 3 gliomas from the TCGA dataset, which were not used for training, our models predicted survival with a C-index of 0.654 (95% CI: 0.537, 0.768) and IDH mutations with an AUC of 0.814 (95% CI: 0.721, 0.897). If validated in a prospective study, our method could potentially assist clinicians in managing and treating patients with diffusely infiltrating gliomas
A Semantic-Based Method for Extracting Concept Definitions from Scientific Publications: Evaluation in the Autism Phenotype Domain
Background: A variety of informatics approaches have been developed that use information retrieval, NLP and text-mining techniques to identify biomedical concepts and relations within scientific publications or their sentences. These approaches have not typically addressed the challenge of extracting more complex knowledge such as biomedical definitions. In our efforts to facilitate knowledge acquisition of rule-based definitions of autism phenotypes, we have developed a novel semantic-based text-mining approach that can automatically identify such definitions within text.
Results: Using an existing knowledge base of 156 autism phenotype definitions and an annotated corpus of 26 source articles containing such definitions, we evaluated and compared the average rank of correctly identified rule definition or corresponding rule template using both our semantic-based approach and a standard term-based approach. We examined three separate scenarios: (1) the snippet of text contained a definition already in the knowledge base; (2) the snippet contained an alternative definition for a concept in the knowledge base; and (3) the snippet contained a definition not in the knowledge base. Our semantic-based approach had a higher average rank than the term-based approach for each of the three scenarios (scenario 1: 3.8 vs. 5.0; scenario 2: 2.8 vs. 4.9; and scenario 3: 4.5 vs. 6.2), with each comparison significant at the p-value of 0.05 using the Wilcoxon signed-rank test.
Conclusions: Our work shows that leveraging existing domain knowledge in the information extraction of biomedical definitions significantly improves the correct identification of such knowledge within sentences. Our method can thus help researchers rapidly acquire knowledge about biomedical definitions that are specified and evolving within an ever-growing corpus of scientific publications
- …